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Abstract 

We analyze a new algorithm for probability forecasting of binary obser- 
vations on the basis of the available data, without making any assumptions 
about the way the observations are generated. The algorithm is shown to 
be well calibrated and to have good resolution for long enough sequences 
of observations and for a suitable choice of its parameter, a kernel on the 
Cartesian product of the forecast space [0, 1] and the data space. Our 
main results are non-asymptotic: we establish explicit inequalities, shown 
to be tight, for the performance of the algorithm. 



1 Introduction 

We consider the problem of forecasting a new observation from the available 
data, which may include, e.g., all or some of the previous observations and the 
values of some explanatory variables. To make the process of forecasting more 
vivid, we imagine that the data and observations are chosen by a player called 
Reality and the forecasts are made by a player called Forecaster. To establish 
properties of forecasting algorithms, the traditional theory of machine learn- 
ing makes some assumptions about the way Reality generates the observations; 
e.g., statistical learning theory assumes that the data and observations are 
generated independently from the same probability distribution. A more recent 
approach, prediction with expert advice (see, e.g., [Sj), replaces the assump- 
tions about Reality by a comparison class of prediction strategies; a typical 
result of this theory asserts that Forecaster can perform almost as well as the 
best strategies in the comparison class. This paper further explores a third 
possibility, suggested in [U], which requires neither assumptions about Reality 
nor a comparison class of Forecaster's strategies. It is shown in that there 
exists a forecasting strategy which is automatically well calibrated; this result 
has been further developed in, e.g., ^H20]. Almost all known calibration re- 
sults, however, are asymptotic (see [221 an d EH for a critique of the standard 
asymptotic notion of calibration); a non-asymptotic result about calibration is 
given in |19| . Proposition 2, but even this result involves unspecified constants 
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and randomization. The main results of this paper (Theorems ^ and [2J estab- 
lish simple explicit inequalities characterizing calibration and resolution of our 
deterministic forecasting algorithm. 

Next we briefly describe the main features of our proof techniques and their 
connections with the literature. The proofs rely on the game-theoretic approach 
to probability suggested in The forecasting protocol is complemented by 
another player, Skeptic, whose role is to gamble at the odds given by Forecaster's 
probabilities. It can be said that our approach to forecasting is Skeptic-based, 
whereas the traditional approach is Reality-based and prediction with expert ad- 
vice is Forecaster-based. The two most popular formalizations of gambling are 
subsequence selection rules (going back to von Mises's collectives) and martin- 
gales (going back to Ville's critique |2H] of von Mises's collectives and described 
in detail in [2U). The pioneering paper on what we call the Skeptic-based 
approach, as well as the numerous papers developing it, used von Mises's notion 
of gambling; j^H] appears to be the first paper in this direction to use Ville's 
notion of gambling. Another ingredient of this paper's approach, considering 
Skeptic's continuous strategies and thus avoiding randomization by Forecaster 
(which was the standard feature of the previous work) goes back to ^5] and is 
also described in |12|: however, I learned it from Akimichi Takemura in June 
2004 (whose observation was prompted by Glenn Shafer's talk at the University 
of Tokyo). 

It should be noted that, although our approach was inspired by JT] and 
papers further developing precise statements of our results and our proof 
techniques are completely different: they are more in the spirit of Levin's ^11 
result about the existence of neutral measures (see for details). 

This version (version 4) of this technical report differs from the previous 
one in that it incorporates the changes made in response to the comments of 
the reviewers of its journal version (to be published in Theoretical Computer 
Science). 

2 The algorithms of large numbers 

In this section we describe our learning protocol and the general forecasting 
algorithm studied in this paper. The protocol is: 

FOR n = 1,2,...: 

Reality I announces x n € X. 

Forecaster announces p n € [0, 1]. 

Reality II announces y n S {0, 1}. 
END FOR. 

On each round, Reality chooses the datum x n , then Forecaster gives his forecast 
p n for the next observation, and finally Reality discloses the actual observation 
y n G {0, 1}. Reality chooses x n from a data space X and y n from the two-element 
set {0, 1}; intuitively, Forecaster's move p n is the probability he attaches to the 
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event y n = 1. Forecasting algorithm is Forecaster's strategy in this protocol. 
For convenience in stating the results of ^ we split Reality into two players, 
Reality I and Reality II. 

Our learning protocol is a perfect-information protocol; in particular, Re- 
ality may take into account the forecast p n when deciding on her move y n . 
(This feature is unusual for probability forecasting but it extends the domain of 
applicability of our results and we have it for free.) 

Next we describe the general forecasting algorithm that we study in this 
paper (it was derived informally in E3])- A function K : Z 2 — > K, where 
Z is an arbitrary set and K is the set of real numbers, is a kernel on Z if 
it is symmetric (K(z,z') = K(z',z) for all z, z' E Z) and positive definite 
(YaLi YJjLi AiAj-K(«j, Zj) > for all (Ai, . . . , A m ) G K m and all (zi, ... , z m ) G 
Z m ). The usual interpretation of a kernel K(z, z') is as a measure of similarity 
between z and z' (see, e.g., [221, §1- 1) . Our algorithm has one parameter, which 
is a kernel on the Cartesian product [0, 1] x X. The most straightforward way of 
constructing such kernels from kernels on [0, 1] and kernels on X is the operation 
of tensor product. (See, e.g., [j2|SSl|23].) Let us say that a kernel K on [0, 1] x X 
is forecast- continuous if the function K((p, x), (p' , x')), where p,p' G [0, 1] and 
x, x 1 G X, is continuous in ip,p') for any fixed (x, x 1 ) G X 2 . 

K29* ALGORITHM 

Parameter: forecast-continuous kernel K on [0, 1] x X 
FOR n = 1,2,...: 
Read x n E X. 

Set Snip) ■= YhZx K((p,x n ), ipi,x i ))iy l -p i ) + ^Kiip,x n ), (p,x n ))(l-2p) 
for p G [0, 1]. 

If signS„(0) = signS*„(l) ^ 0, output p n := (1 + sign 5„(0))/2; 

otherwise, output any root p of S„(p) — as p n . 
Read y n G {0,1}. 
END FOR. 

(Since the function S n ip) is continuous, the equation S n (p) = indeed has 
a solution when signS'„(0) = sign5„(l) 7^ does not hold; remember that 
signS 1 is 1 for S positive, —1 for S negative, and for S = 0.) The main term 
in the expression for S n (p) is £)"=i K((p, x n ), (ft, aff - Pi)- Ignoring the 
other term for a moment, we can describe the intuition behind this algorithm 
by saying that p n is chosen so that pi are unbiased forecasts for yi on the 
rounds i = 1, . . . , n — 1 for which ip%,Xi) is similar to ip n ,x n ). The term 
|K((p, x n ), (p,x n ))il — 2p), which can be rewritten as K((p, x n ), (p,x n ))(0.5 — 
p), adds an element of regularization, i.e., bias towards the "neutral" value 

Pn = 0.5. 

The K29* algorithm requires solving the equation S n ip) = 0, but this can 
be easily done using the bisection method or one of the numerous more sophis- 
ticated methods (see, e.g., Chapter 9). 

It is well known (see ^U], Theorem II. 3.1, for a simple proof) that there 
exists a function $ : [0, 1] x X — > Tt (a feature mapping taking values in a 
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Hilbert space 1 H. called the feature space) such that 

K(o, b) = ($(a), $(&))„ , Va, 6 G [0, 1] x X 



(1) 



((■, -)t-i standing for the inner product in TL). It is known that, for any K and 
$ connected by QJ, K is forecast-continuous if and only if $ is a continuous 
function of p for each fixed i£X (see Appendix lB|l . 

Now we can state the basic result about K29* (proved in Appendix |A"|) . 

Theorem 1 Let K be the kernel defined by QJ) for a feature mapping $ : [0, 1] x 
X — ► Ti. continuous in its first argument. The K29* algorithm with parameter 
K ensures 



N 

E 

n=l 



iljn - Pn)$(Pn,X n ) 



H 



N 

< E^ 1 ~Pn) \\^iPn,X n )\\ 2 H , 
n=l 



Let us assume, for simplicity, that 

c K := sup||$(p,a;)|| w < oo 

p,x 



ViV € {1,2,...}. (2) 



(3) 



(it is often a good idea to use kernels with \\&(p, x)\\ H = 1 and, therefore, 
ck = !)■ Equation J2J then implies 



N 



Z^iVn -Vn)®(Vn,Xn) 



< 



ck 



N, VA^e {1,2,...}. 



(4) 



H 



When $ is absent (in the sense $ = 1), this shows that the forecasts p n are 
unbiased, in the sense that they are close to y n on average; the presence of $ 
implies, for a suitable kernel, "local unbiasedness" . This is further discussed in 
the first part of fJ3 

In the conference version |3J of this paper we also considered the K29 algo- 
rithm, which differs from K29* in that S n (p) is defined as 



Snip) 



n-l 

E 

i=l 



K((p,x n ), ipi,Xi))iyi - pi) 



and that the requirement that K should be forecast-continuous is slightly relaxed 
(the joint continuity in (p,p') is replaced by the separate continuity in p and 
p'). For the K29 algorithm, the inequality J5J) continues to hold if p n (l — Pn) is 
removed; therefore, continues to hold if the denominator 2 is removed. We 
will sometimes use "algorithms of large numbers" as generic name for the K29 
and K29* algorithms; the motivation for these names is that the main properties 
of these algorithms are easy corollaries of Kolmogorov's 1929 proof JHj of the 
weak law of large numbers. 

1 Hilbert spaces in this paper are allowed to be non-separable or finite dimensional; we, 
however, always assume that their dimension is at least 1. 
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3 Reproducing kernel Hilbert spaces 



A reproducing kernel Hilbert space (RKHS) on a set Z is a Hilbert space T of 
real-valued functions on Z such that the evaluation functional / G T i— » /(z) is 
continuous for each z 6 Z. By the Riesz-Fischer theorem, for each z & Z there 
exists a function K 2 6 T such that 

/(«) = <K„/) J r, V/eJR (5) 

The kernel of RKHS T is 

K(z,z') :=(K Z ,K Z ,)^ (6) 

(equivalently, we could define K(z,z') as K z (z') or as K z /(z)). Since © is a 
special case of the function K defined by 10 is indeed a kernel on Z, as 
defined earlier. On the other hand, for every kernel K on Z there exists a unique 
RKHS T on Z such that K is the kernel of T (see, e.g., Theorem 2). 

A long list of RKHS and the corresponding kernels is given in |3J, §7.4. 
Perhaps the most interesting RKHS in our current context are various Sobolev 
spaces W m ' p (il) (|T] is the standard reference for the latter). We will be in- 
terested in the especially simple space W /1 ' 2 ([0, 1]), to be defined shortly; but 
first let us make a brief terminological remark. The term "Sobolev space" is 
usually treated as the name for a topological vector space. All these spaces are 
normable, but different norms are not considered to lead to different Sobolev 
spaces as long as the topology does not change. 

The Fermi-Sobolev norm ||/|| F g of a smooth function / : [0, 1] — > M is defined 

by 

ll/ll FS := (jf /(*) dt ) + J\f'(t)f dt. (7) 

The Fermi-Sobolev space on [0, 1] is the completion of the set of smooth / : 
[0,1] — > K satisfying ||/|| Fg < oo with respect to the norm ||-|| F g. It is easy 
to see that it is in fact an RKHS (indeed, if ||/|| FS = c < oo, the mean of / 
is bounded by c in absolute value and \f(b) — f(a)\ < \f'(t)\ dt < c for all 
0<a<6<l). As a topological vector space, it coincides with the Sobolev 
space H^ 1,2 ([0, 1]). The Fermi-Sobolev space on [0, l] fc is the tensor product of 
k copies of the Fermi-Sobolev space on [0, 1]. 

The kernel of the Fermi-Sobolev space on [0, 1] was found in |S] (see also 
§10.2); it is given by 

K(M') = k a (t)k Q (t') + fei(t)fci(t') + k 2 (\t - t'\) 

= i+ H) HHH |2 -"-'' l+ s) 

= imin 2 (t,i') + imin 2 (l-t,l-f') + ^, (8) 

where ki := Bi/ll are scaled Bernoulli polynomials P>i. We will derive the final 
expression for K(t,i') in JHJl in Appendix IO For the Fermi-Sobolev space on 
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[0, l] fc we have 

k (\ 1 
K ((*!, . . . , t fc ), (ti, . . . , ^)) = J] ( o min2 (* i ' *<) + 2 min2(1 " **' 1 " ^ 

«=i ^ 

and, therefore, 



max f^ 2 + i(l - i) 2 + = f- 
te[o,i]V 2 2 67 \3, 



5 
6 

(9) 

(10) 



For further information about the Fermi-Sobolev spaces, see |31) . 

4 The K29* algorithm in RKHS 

We can now deduce the following corollary from Theorem ^ 

Theorem 2 Let T he an RKHS on [0, 1] x X with a forecast- continuous kernel 
K. The K29* algorithm with parameter K ensures 



N 



n ■> %n ) 



for all N and all f £ T . 



< \\.f\ 



N 



\ ^2Pn(l -p n )K((p n ,X n ), (p n ,X n )) (11) 
\ n=l 



Proof Applying K29* to the feature mapping (p, x) G [0, 1] x X h K PiX S T 
and using we obtain, for any f £ 



N 



^2(Vn -Pn)f(Pn,Xn) 



N 



^2(Vn ~Pn) (K p „, x „,/) 



n=l 



N 



^2(Vn -Pn)^p„,x„,f 



T 



< 



N 



\f\\ 



N 



< WfWr 

When ck in J2J is finite, 111(1 implies 

N 

^2(Vn -Pn)f(Pn,X n ) 



\ ^Pni 1 -Pn)K((p n ,X n ), (p n ,X n ))- 

\ 71=1 



n=l 



<^ll/lly v V. 



(12) 



G 



5 Informal discussion 



In this section we explain why the inequalities in Theorems ^ an d HI can be 
interpreted as results about calibration and resolution, and then briefly discuss 
a puzzling aspect of the algorithms of large numbers. For concreteness, we 
usually talk about the K29* algorithm, but all we say can also be applied, with 
obvious modifications, to K29. 



Calibration, resolution, and calibration-cum-resolution 

We start from the intuitive notion of calibration (for further details, see [5] 
and ^J)- The forecasts p n , n = 1, . . . , N, are said to be "well calibrated" (or 
"unbiased in the small", or "reliable", or "valid") if, for any p* G [0, 1], 

Sn=l,...,JV:p„!«p* Vn ^ 
^2/n—l....,N:p n ^ip* ^ 

provided J2 n =i jv- p „« p * 1 * s not to ° sman - The interpretation of l(T3| is that 
the forecasts should be in agreement with the observed frequencies. It will be 
convenient to rewrite i|13|) as 

Xm=l,...,iV:p»«ip* (f» — Pn) 
^-/n— l,...,A^:p 71 ?sp* ~^ 

The fact that good calibration is only a necessary condition for good fore- 
casting performance can be seen from the following standard example [51 lll|: 
if 

(2/1,2/2,2/3,2/4, •••) = (1,0,1,0,...), 

the forecasts p n = 1 /2, n = 1, 2, . . ., are well calibrated but rather poor; it would 
be better to forecast with 

(Pl,P2,P3,P4, ■ ■ ■) = (1,0,1,0,...). 

Assuming that each datum x n contains the information about the parity of 
n (which can always be added to x n ), we can see that the problem with the 
forecasting strategy p n = 1/2 is its lack of resolution: it does not distinguish 
between the data with odd and even n. In general, we would like each forecast 
p n to be as specific as possible to the current datum x n ; the resolution of a 
forecasting algorithm is the degree to which it achieves this goal (taking it for 
granted that x n contains all relevant information). 

Analogously to (fH}> . the forecasts p ni n = 1,,..,N, may be said to have 
good resolution if, for any x* E X, 

Sn=l,...,Af:a; Tl «a:* (Vn ~ Pn) ^ 

V 1 ~ 

L~d n — 1 , . . . , TV : x n x * 
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provided the denominator is not too small. We can also require that the fore- 
casts p n , n — 1, . . . , N, should have good "calibration-cum- resolution" : for any 
(p' 1 i*)6[Q,l]xX ) 

Yln=l,...,N:(p n ,x n )Ri(p* ,x*)(Vn ~ Pn) ^ 

V 1 ~ 

Z-^n— 1,. . .,N:(p n ,x n )zz(p* ,x*) 

provided the denominator is not too small. Notice that even if forecasts have 
both good calibration and good resolution, they can still have poor calibration- 
cum-resolution. 

It is easy to see that (0J implies good calibration-cum-resolution for a suitable 
<E> and large N: indeed, @ shows that the forecasts p n are unbiased in the 
neighborhood of each (p*,x*) for functions $ that map distant (p, x) and (p', x') 
to almost orthogonal elements of the feature space (such as $ corresponding to 
the Gaussian kernel 

K ((p, x), (p>, x')) := cxp ^P-P'?+Jx-A\ 2 ^ (15) 

for a small "kernel width" a > 0). 

In general, to make sense of the ~ in the numerator and denominator of, 
say, l|14l) . we replace each "crisp" point p* by a "fuzzy point" I p * : [0, 1] — > [0, 1]; 
I p * is required to be continuous, and we might also want to have I p *(p*) = 1 
and I p * (p) = for all p outside a small neighborhood of p* . The alternative 
of choosing I p * :— I[ P _ )P+ ], where [p_,p + ] is a short interval containing p* and 
I[p_ )P+ i is its indicator function, does not work because of Oakes's and Dawid's 
examples |17l l8*]: I p » can, however, be arbitrarily close to I[ p _ iP+ ]. 

Consider, e.g., the following approximation to the indicator function of a 
short interval containing p* : 

if p~ + e < p < p+ — e 
n , )" tfp<p--eorp>p + + e 

j-(p-p_) if p_ - e < p < p- +e 

h(p+~p) if p+ - e - p - p+ + e ; 

we assume that e > satisfies 

0<p_— e < p- + e < p + — e < p + + e < 1. 

It is clear that this approximation belongs to the Fermi-Sobolev space. An easy 
computation shows that H12fl and (|10|l imply 




N 



v^ 1 



^2(Vn ~Pn)f(Pn 
n=l 

for all N. We can see that (I14L in the form 

J2n=l,...,N f(Pn)(Vn ~P_ 
zJn=l,...,JV f(.Pn) 



- -^\l(- e + (p+-p-) 2 ) v c~) 



o, 



will hold if 

N 

E » ^ 

n=l 

(roughly, if significantly more than y/N forecasts fall in the neighborhood 

[P~,P + ] Of £>*). 

It is clear that inequalities analogous to (|17|l can also be proved for "soft 
neighborhoods" of points (p*,x*) in [0,1] x X (at least when X is a domain 
in a Euclidean space) , and so Theorem [21 also implies good calibration-cum- 
resolution for large N. Convenient neighborhoods in [0, 1] x [0, 1]^ can be 
constructed as tensor products of neighborhoods (fL6l) . 

Inequality (|17(l and analogous inequalities expressing resolution and 
calibration-cum-resolution are explicit in the sense that they do not involve 
limits, o, O, unspecified constants, etc. The price to pay is their relative com- 
plexity; therefore, we also state a simple asymptotic result about calibration- 
cum-resolution. 

Corollary 1 // X is a compact metric space, some forecasting algorithm guar- 
antees 

1 N 

J™ Z2(y™ ~ Pn)f(Pn,X n ) = (18) 

n=l 

for all continuous functions f : [0, 1] x X — > R. 

Calibration corresponds to the case where f(p,x) = I p *(p) does not depend 
on x and resolution to the case where f{p,x) = I x *{x) does not depend on p. 
This result was proved in ^2] in the case of calibration (there are no x n ) and 
Lipschitz functions /. 

Proof of Corollary ^ Let T be an RKHS on [0, 1] x X which is universal, i.e., 
dense in the space C([0, 1] x X), and whose kernel K is continuous and satisfies 
ck < oo. The notion of universality is introduced in |25| . Definition 4, and the 
existence of such an T is shown in |26) . Theorem 2. For any continuous function 
/ : [0, 1] x X — > R there is a g € T that is e-close to / in the metric C([0, 1] x X), 
and so, by |JT2")|. 



lim sup 



1 N 

-Y 

n=l 



(Vn - Pn)f(Pn,Xn) 



< lim sup 

TV— foo 



1 N 



PnjgiPmXr, 



< lim sup — II oil t 



since this holds for any e > 0, (|18|l also holds. I 

One of the algorithms achieving [tT%|) for X = [0, l] k is K29* applied to the 
Fermi-Sobolev kernel ©. It is interesting, and somewhat counterintuitive, that 
K29* applied to the Gaussian kernel (|15|) (with any a > 0) also achieves (I18|) : 
the universality of the Gaussian kernels is proved in |25j (Example 1). 
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Our discussion of calibration and resolution in this subsection has been some- 
what speculative, and the reader might ask whether these two properties are 
really useful. This question is answered, to some degree, in [3011321, which show 
that probability forecasts satisfying these properties lead to good decisions (at 
least in the simple decision protocols considered in those papers). 



Puzzle of the iterated logarithm 



Theorems ^ an d |U imply that the forecasts produced by the K29* algorithm are 
even closer to the actual observations on average than in the case of "genuine 
randomness" , where Reality produces the data and observations from a proba- 
bility distribution on (X x {0, 1})°° and each p n is the conditional probability 
that y n = 1 given x\, . . . , x n , t/i, . . . , y n -i, and whatever further information 
may be available at this point. Indeed, let us take, for simplicity, $ = 1 (and 
TL := K) in Theorem ^ According to the martingale law of the iterated loga- 
rithm (see, e.g., [22] or Chapter 5 of [23), we would expect 



lim sup 

JV— >oo 



Pn, 



yj2A N In In A 



1. 



JV 



where An := X^Li2?n(l — Pn) is assumed to tend to oo as N 
expect, contrary to 10} , 



(19) 



oo, and so 



sup 

jve{i,2,...} 



Pn)$(Pn,X n ) 



n 



N 



to be infinite for p n not consistently very close to or 1. Actually, in this case 
($ = 1) Forecaster can even make sure that 



JV 

E 

71=1 



{Vn -Pn)$(p n 5 *^n ) 



H 



We {1,2,...} 



(choosing p x := 1/2 and p n := y n _i, n = 2,3, . . .). 

For a general $, we can also expect that the probabilities p n contrived by 
the algorithms of large numbers (K29 or K29*) will have better calibration and 
resolution than the true probabilities. There is, however, little doubt that the 
true probabilities are more useful than any probabilities we are able to come 
up with. The true probabilities are not as good at calibration and resolution, 
so they must be better in some other equally important respects. It remains 
unclear what these other respects may be, and this is what we call the puzzle 
of the iterated logarithm. 



6 Optimality of the K29* algorithm 

In this section we establish that the inequalities in Theorems ^ and [2 are tight, 
in a natural sense. 
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Equation J2J) says that the differences y n — p n are small on average, even 
when scattered in a Hilbert space by multiplying by <fr(p n , x n ). The next result 
says that it is the best Forecaster can do. 

Theorem 3 Let $ : [0, 1] x X — > 7i, where H is a Hilbert space. There is a 
strategy for Reality II which guarantees that 



N 



N 



2 
H 



> ^PnQ- ~Pn) \\$(p n ,X n )\\ 
H n=1 

always holds for all N = 1, 2, . . ., regardless of what the other players do. 
Proof Set 

N 



(20) 



Rn '■= 



N=l,2,. 



n 



it is sufficient to show that on the iVth round, N = 1,2,. 
ensure that 



1? 



N 



Rn-i>PnO--Pn)®n, 



Reality II can 
(21) 



where 

$at := \\$(p N ,x N )\\ n 
Fix an N. Define points A, C, D £ Ti as 

N-l 

J2(y 



c 



n=l 
N-l 



A := 



D := 



n=l 
N-l 



Pn)®(Pn,Zn), 

■ Pn)$(Pn,Xn) + (1 - PN ) $ (j>N , X N ) , 

■ Pn)®(Pn,X n ) + (-p N )$(p N ,X N ); 



n=l 



it is up to Reality II whether make Rn equal to \OA\ or |OZ)|, where O is 
the origin. Assuming, without loss of generality, that Rn — max(\OA\, \OD\), 
we reduce our task to showing that the maximal value of Rn-i for fixed Rn, 
$jv, and pn satisfies (|21|) . It is geometrically obvious (see the last paragraph of 
this proof for a rigorous argument) that Rn-i attains its maximal value when 
\OA\ = \OD\; this is illustrated in Figure ^ (remember that all four points, O, 
A, C, and D, lie in the same plane). Let B be the base of the perpendicular 
dropped from O onto the interval AD and h := \OB\. Since the triangles OBD 
and OBC are right-angled, 



R 2 



n 



R 



N-l 



h 2 
h 2 



N 



-$ W ~PN$N 
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Figure 1: The worst case for Reality II; \OA\ = \OD\ = R N , \OC\ = R N -i, 
\AC\ = {l-p N )$ N , \CD\ =p N <S>N: \OB\ = h. 



Subtracting the second equality from the first, we obtain 



R 



N 



in i = ( 1** 



N 



Pn{1 - pn)$ n - 



In conclusion, let us see that the maximum of Rn-i is indeed attained when 
\OA\ = \OD\. Assume that \OA\ — Rn, with \OD\ now allowed to be less than 
Rn- Because of the compactness of the disk in Figure^ (we are only interested 
in two-dimensional subspaces of H, which are isometrically isomorphic to M 2 ), 
the maximum of \OC\ is attained at some point C. Supposing \OD\ < Rn, it 
is, however, easy to check that no C will be a point of local maximum for |OC|; 
the least trivial case is perhaps where O lies on the line AD and C is between 
O and D. I 

The next result establishes the tightness of the bound in Theorem |2 

Theorem 4 Let T be an RKHS on [0, 1] x X with kernel K. Reality II has 
a strategy which ensures, regardless of what the other players do, that for each 
N = 1,2,... there exists a non-zero f £ T such that 



N 



^2(Vn- Pn)f{Pn,X n ) > \\.f\ 



T 



n=l 



N 



\ /.Pni 1 - Pn) K ((Pni x n), (Pn, X n ))- (22) 
\ n=l 

Proof By Theorem |3 there exists a strategy for Reality II which ensures 

N 

2jpn(l — p n )K((p n ,a; n ), (p n ,X n ))- (23) 



N 

E 



( Vn Pn ) ^-p n , x „ 



> 



Taking 



\ 
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we obtain: 



N 



N 



^2(Vn -Pn)f(Pn,Xn) = ^(Vn ~ Pn) (K Pn , Xn , f) 



n=l 



N 



^2iVn - Pn)Kp n ,x n , f 



\n=l 



T 



N 



^2(Vn-Pn)Kp 



n=l 



> 11/11 



F ■ 



N 



\ ^2Pn(l -pn)K((p n ,X n ), (p n ,X n )). 
\ n=l 



If / 7^ 0, our task is accomplished. Otherwise, the right-hand side of (|23|l will 
also be zero, and we can take any / ^ 0. I 
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A Proof of Theorem [T] 

The proof of Theorem ^ is based on the game-theoretic approach to the foun- 
dations of probability proposed in A new player, called Skeptic, is added 
to the learning protocol of jJ2 the idea is that Skeptic is allowed to bet at the 
odds defined by Forecaster's probabilities. In this proof there is no need to 
distinguish between Reality I and Reality II. 

Binary Forecasting Game I 
Players: Reality, Forecaster, Skeptic 
Protocol: 
/C := C. 

FOR n = 1,2,...: 

Reality announces x n € X. 

Forecaster announces p n £ [0,1]. 

Skeptic announces s n € R. 

Reality announces y n € {0, 1}. 

K n := JC n -i + s n {y n - Pn)- 
END FOR. 

The protocol describes not only the players' moves but also the changes in 
Skeptic's capital K n ] its initial value is an arbitrary constant C. 

The crucial (albeit very simple) observation |34| is that for any continuous 
strategy for Skeptic there exists a strategy for Forecaster that does not allow 
Skeptic's capital to grow, regardless of what Reality is doing (similar observa- 
tions were made in and ^2])- To state this observation in its strongest form, 
we will make Skeptic announce his strategy for each round before Forecaster's 
move on that round rather than announce his full strategy at the beginning of 
the game. Therefore, we consider the following perfect-information game: 

Binary Forecasting Game II 
Players: Reality, Forecaster, Skeptic 
Protocol: 
/C := C. 

FOR n — 1,2,...: 

Reality announces x n G X. 

Skeptic announces continuous S n : [0, 1] — * R. 

Forecaster announces p n € [0,1]. 
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Reality announces y n € {0, 1}. 
K, n := /C„_i + S n (p n )(y n -p n ). 
END FOR. 



Lemma 1 Forecaster has a strategy in Binary Forecasting Game II that ensures 
/Co > /Ci > /C2 > • • ■ • 

Proof Forecaster can use the following strategy to ensure Kq > K,\ > ■ ■ ■ : 

• if S n (0) and S n (l) are both positive or both negative, take p n := (1 + 
signS„(0))/2; 

• otherwise, choose p n so that S n (p n ) — (such a p n will exist). I 

A measure-theoretic version of Lemma ^ (involving randomization) was 
proved in [T§|, Proposition 1. 

Proof of the theorem 

We start by noticing that 

(Un - Pn) 2 = P«(l - Pn) + (1 - 2p n )(y n - p n ) (24) 

both for y n = and for y n = 1 . Following K29* , Forecaster ensures that Skeptic 
will never increase his capital with the strategy 

n— 1 y 
S n ■= y^^K((p n ,X n ),(pi,Xi)) {yi-pi) + -K((pn,X n ),(p n ,%n)) (1 - 2p„) (25) 
i=l 

(continuous in p n by our assumptions). The increase in Skeptic's capital when 
he follows (|2*5f is 

N 

ICn - K-o — s n (y n - p n ) 
n=l 
JV ra-1 

= X X^^™' 2 ^' -Pn)(Vi - Pi) 

n—1 i—l 

l w 

+ ^ X] K ((P"' 35 ")' (Pn'^n)) (! ~ 2 Pn)(Vn ~ Pn) 

n=l 
^ N N 

= 2 XX K ((Pn'^n)' (Pi^i)) {Vn ~ Pn){Vi - Pi) 
n—1 i—l 

l w 

- 2 X K ((^"' (Pn,»n)) (j/r. - Pn) 2 
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1 N 

o ^2 K ((Pn'^n)) On^n)) (1 ~ ^Pn){Vn ~ Pn) 



2 

n=l 



= 2 X/ X] K (p^')) (y« ~ p«)(y« - Pi) 

n— 1 i= 1 

1 w 

- X! K ((P"'^")' (Pn,a:n))Pn(l-Pn) 

2 n=l 

(we used <|24[) in the last equality). We can rewrite this as 



AT 



n 7 J 



n=l 



1 * 

- ^X] Pn ( 1_Pn ) ll $ (Pn,a;n)|| w , 



2 

ft n=l 



which immediately implies J3Jl. 



B Forecast-continuity of feature mappings and 
kernels 

In this appendix we will prove, essentially following |25j . Lemma 3, that the 
forecast-continuity of a kernel K on [0, 1] x X is equivalent to the continuity 
in p of a feature mapping 3>(p, x) satisfying (I). As a byproduct, we will also 
see that the forecast-continuity of a kernel K on [0, 1] x X can be equivalently 
defined by requiring that 

• K((p, x), (p 1 , x)) should be continuous in p, for all x £ X and all p' £ [0, 1], 

• and K((p, x), (p, x)) should be continuous in p, for all a; € X. 

In one direction the statement is obvious: if ^(p, x) is continuous in p, the 
continuity of the operation of taking the inner product immediately implies that 
K is forecast-continuous, in both senses. 

Now suppose that K is forecast-continuous, as defined in the first paragraph 
of this appendix (this is the apparently weaker sense of forecast-continuity). To 
complete the proof, notice that 

\\$(p,x) - $(p n ,x)\\ n 

= V K ((P' X )> (P, x )) - 2K ((P: Z), (p n ,x)) +K((p n ,x), (j>n,x)) 

- y/K((p, x), (p, x)) - 2K((p, x), (p, xj) + K((p, x), (p, x)) = 
when p„ — > p (n — > oo). 
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C Derivation of the kernel of the Fermi— Sob olev 



space 

We first describe the standard reduction of the problem of finding the kernel of 
an RKHS to a variational problem. Let K be the kernel of an RKHS T on Z. 

Let c £ Z. According to ^5] (Satz III. 3), the minimum of \\f\\jr among the 
functions / S T satisfying /(c) = 1 is attained by the function K(-, c)/K(c, c). 
Therefore, we obtain a function fc(-,c) proportional to K(-,c) by solving the 
optimization problem \\f\\jr — > min under the constraint /(c) = 1 (or under 
the constraint /(c) = d, where d is any other constant). It remains to find the 
coefficient of proportionality in terms of k(-, c). If K(-, •) = ak(-, •), we have: 

K(c,c) = ||K(-,c)|£i 
ak(c, c) = a 2 ||fc(-, c)\\jr ; 
k(c, c) 



\H;c)\\ 2 



T 

Therefore, the recipe for finding K is: for each c £ Z solve the optimization 
problem \\f\\jr — > min under the constraint /(c) = 1 (the completeness of RKHS 
implies that the minimum is attained) and set 

K(,,c):=^%A (26) 

where fc(-,c) is the solution. 

Now let us apply this technique to finding the kernel corresponding to the 
Fermi-Sobolev space on [0, 1] with the norm given by JJJ). Let c € [0, 1] and let 
/ be the solution to the optimization problem \\f\\jr —> min under the constraint 
/(c) = 1 (because of the convexity of the set {/ S T \ /(c) = 1}, there is only 
one solution). First we show that the derivative /' is a linear function on [0, c] 
and on [c, 1], arguing indirectly. Suppose, for concreteness, that /' is not linear 
on the interval (0, c); in particular this interval is non-empty. There are three 
points < t\ < ti < ts < c such that 

f'(t 2 ) + t p^f'{t 1 ) + t p^f{h). (27) 

t3 — ti C3 — t\ 

For a small constant e > (in particular, we assume 2e < min(£i,i2 — t±,t3 — 
t2, c— t$)), let g : [0, 1] — > R be a smooth function such that L g(t) dt = and: 

• git) = f° r t<ti- e; 

• g(t) is increasing for t\ — e < t < t\ + e; 

• g(t) = t 3 - t 2 for ti + e < t < t 2 - e; 

• g(t) is decreasing for t 2 — e < t < t 2 + e; 
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• = ~(*2 - *i) for £2 + e < £ < £3 - e; 

• g(£) is increasing for £3 — e < £ < £3 + e; 

• g(t) = for £ > £3 + e. 

Since, for any 8 £ R (we are interested in nonzero <5 small in absolute value), 



11/ + *flllF5= 11/11^ + 2*y f'(t)g'(t)dt + 5 2 I (.g'(£)) 2 d£, 
the definition of / implies 

" 1 /'(£)g'(£)d£ = 0. 



However, as e — > 0, the last integral tends to 

/'(£i)(£ 3 - £2) - /'(£ 2 )(£ 3 - £1) + /'(£ 3 )(£ 2 - £1), 

which cannot, by l|27|) . be zero. 

Once we know that / is a quadratic polynomial to the left and to the right 
of c, we can easily find (this can be done conveniently using a computer algebra 
system) that, ignoring a multiplicative constant, 

f(t) = 3£ 2 + 3c 2 - 6c + 8 = 3£ 2 + 3(1 - c) 2 + 5 
to the left of c and 

f(t) = 3£ 2 + 3c 2 - 6£ + 8 = 3(1 - £) 2 + 3c 2 + 5 
to the right of c. By Q26[l. we can now find 

which agrees with JHJ. 
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